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This paper is based on Wald Lectures given at the annual meeting 
of the IMS in Minneapolis during August 2005. It is a survey of the 
theory of large deviations. 

1. Large deviations for sums. The role of "large deviations" is best un- 
derstood through an example. Suppose that X\, X2, ■ ■ ■ , X n , ... is a sequence 
of i.i.d. random variables, for instance, normally distributed with mean zero 
and variance 1. Then, 

E [ e 0(Xi+-+X n )^ = fi^eX^n = e n(0 2 /2)_ 
On the other hand, 

E r e 6(Xi+-+X n )i = E [ e n8(S n /n)^ 

Since, by the law of large numbers, ^ is nearly zero, we have 
S [ e *(x 1 +...+x n )] = E[e°W] + e°( n \ 

There is, of course, a very simple explanation for this. In computing expecta- 
tions of random variables that can assume large values with small probabil- 
ities, contributions from such values cannot be ignored. After all, a product 
of something big and something small can still be big! In our case, assuming 
8 > 0, for any a > 0, 

_ e n6a e -(na 2 /2)+o(n) _ g n(a0-a 2 /2)+o(n) 
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Since a > is arbitrary, 

E[e eSn ] > e nsuPa >o( ea ~ a2 / 2 ) + °( n ) = e «f 2 /2+o(n)^ 

which is the correct answer. 

The simplest example for which one can calculate probabilities of large 
deviations is coin tossing. The probability of k heads in n tosses of a fair 
coin is 

^> = (*) 2 ~" = »r 

which using Stirling's approximation, is 

V2ne- n n n+1 / 2 2~ n 



% /2 7re -(n-fe)( n _ jfc)n-fe+l/2 x /2 7re -fej t fc+l/2 ' 
log P(n, k) ~ — log(27r) (n -\ — ^) log n — (n — k H — ^ log(n — k) 



r 

& H — I log k — n log 2 



2 ov 'V 27 ° V 2 
1 
2 

-^log(27r) - ^logn- fn-fc + ^ log^l - 

1\ A; 

/c H — I log nlog2. 

2/ n 



If k~nx, then 



logP(ra, fc) ~ — re[log2 + xlogx + (1 — x) log(l — x)] + o(n) 
= —nH{x) + o(n), 

where -ff(x) is the Kullback-Leibler information or relative entropy of Bino- 
mial^, 1 — x) with respect to Binomial(^, ^). 

This is not a coincidence. In fact, if fi are the observed frequencies in n 
trials of a multinomial with probabilities {pi} for the individual cells, then 

P(n,p 1 ,...,p k ;f 1 ,...,f k ) = — . Pi •••Pfc ■ 

/l! • • • Jfc! 

A similar calculation using Stirling's approximation yields, assuming fi ~ 

logP(n,pi, . . . ,p k ;fi, ...,f k ) = -nH(x!, . . . , x k \p u ■ ■ ■ ,Pk) + o(n), 
where H(x,p) is again the Kullback-Leibler information number 

k 



i=l 



Pi 
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Any probability distribution can be approximated by one that is concen- 
trated on a finite set and the empirical distribution from a sample of size 
n will then have a multinomial distribution. One therefore expects that the 
probability P(n,a,P) that the empirical distribution 



1 n 
n r-f 



of n independent observations from a distribution a is close to (3 should 
satisfy 

log P(n, a, P) = —nH((3,a) + o(n), 
where H(f3, a) is again the Kullback-Leibler information number 

dp JO [dp, dp 



[, dp [dp dp 

log — dp = — log — da. 
J da J da da 



This theorem, proven by Sanov, must be made precise. This requires a formal 
definition of what is meant by large deviations. We have a family {P n } of 
probability distributions on some space X which we assume to be a complete 
separable metric space. There is a sequence of numbers a n — > oo which we 
might as well assume to be n. Typically, P n concentrates around a point 
xq G X and, for sets A away from xq, P n (A) tends to zero exponentially 
rapidly in n, that is, 

log P n (A) ~ -nc{A), 

where c(A) > if x A. 

We say that a large deviation principle holds for a sequence of probability 
measures P n defined on the Borel subsets of a Polish (complete separable 
metric) space X , with a rate function H(x), if: 

(1) H(x) > is a lower semicontinuous function on X with the property 
that Ki = {x : H(x) < £} is a compact set for every £ < oo; 

(2) for any closed set C C X, 

limsup — log P n (C) < — inf H(x); 

rwoo n xeC 

(3) for any open set U C X, 

liminf — log P n (U) > — inf H(x). 

n-*oo n X& U 

While condition (1) is not really necessary for the validity of the large devi- 
ation principle, it is a useful condition on the rate function that will allow 
us to reduce the analysis to what happens on compact sets. Rate functions 
with this property are referred to as "good" rate functions. 
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For Sanov's theorem, the i.i.d. random variables will be taking values in 
a complete separable metric space X. Their common distribution will be a 
probability measure a £ M =M(X), the set of probability measures on X. 
The space M under weak convergence is a complete separable metric space. 
The empirical distributions will be random variables with values in A4, and 
P n will be their distribution, which will therefore be a probability measure 
on A4. The rate function 

d/3 , _ _ d(3 



H((3, a) = / — log — da 
J da da 



will be defined to be +00 unless (3 <^.a and the Radon-Nikodym derivative 
/ = 2§ is such that /log / is integrable with respect to a. 

We start with i.i.d. 



2. Rate functions, duality and generating functions. 

random variables and look at the sample mean 

X! + --- + X n 



1 n — 

n 



n 



According to a theorem of Cramer, its distribution P n satisfies a large devi- 
ation principle with a rate function h{a) given by 

h(a) =sup[6a-logE[e ex }]. 
e 

In general, if a large deviation principle is valid for probability measures 
P n on a space X with some rate function H(x), it is not hard to see that, 
under suitable conditions on the function F (boundedness and continuity 
will suffice), 

- log / e nF ^ dP n sup[F(x) - H(x)}. 
n J x 

The basic idea is just 



n 



iog]T> 



sup a-i . 

i 



In other words, the logarithms of generating functions are dual to, or Leg- 
endre transforms of, large deviation rate functions. For instance, 



log 



'Je v ^da 


= sup 


[J 









V{x)dp-H{p,a) 



If the rate function is convex, as it is in the case of sums of independent 
identically distributed random variables, the duality relationship is invertible 
and 



H((3,a) 



sup 
v(-) 



V(x)d(3 - log E[e v ^ da] 



where the supremum on V in either case is taken over all bounded measur- 
able functions or bounded continuous functions. 
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3. Markov processes. This relationship for i.i.d. sequences can be ex- 
tended to the Markovian context. For simplicity, let us assume that we have 
a finite state space X and transition probabilities ir(x,y) of a Markov chain 
on X. Let us suppose that n(x, y) > for all x, y. If V(-) :X^Ris& function 
on X, then 



E x [exp[V(X 1 ) + V(X 2 ) + 
can be explicitly evaluated as 

~]ir v (x,y), 



+ V(X n )}} 



y 



where 7ry(x,y) = n(x,y)e v ^ and tt v is the nth power of ny. Since ny is a 
matrix with positive entries, 

1 



log J2n v (x,y) -^\og\ n (V), 

y 



where A„-(V") is the principal eigenvalue of 7iy. The analog of Sanov's the- 
orem in this context establishes a large deviation result for the empirical 
distributions of a long sequence {xi,x 2 , ■ ■ ■ ,x n }. They belong to the space 
M of probability measures on X . Let us denote by P the Markov process 
with transition probability ir(x,y) and by Q n the measure on Ai which 
is the distribution of the empirical distribution {x±, x 2 , ■ ■ ■ , x n }. The large 
deviation upper bound for Q n , with a rate function given by 



sup 

v 



J2V(x)q(x)-logX n (V) 



is an easy consequence of estimates on the generating function. 

There is a more direct way of approaching H n through lower bounds. Let 
us pretend that our Markov chain exhibits "atypical" behavior and behaves 
like a different Markov chain, one with transition probability tt(x, y). In other 
words, the empirical distribution of visits to different sites of X is close to 
q, which turns out to be the invariant distribution for a different chain, one 
with transition probabilities Tr(x,y). We can estimate the probability of this 
event. Let U be an open set around q in the space M of probability measures 
on X. Let A n be the set of all realizations of {x±, . . . ,x n } with empirical 
distributions belonging to U. We can estimate P(A n ), the probability of A n 
under the original n(x,y) chain, by 



P(A r 



TT(x,Xl)---TT(x n - 1 ,X r 



Xl,X 2 , 



^ 7r(x,Xl) 
xi,X2,...,x n £A n 



■Tr(x n -x,x n )exp 



7T( X%—\ , Xi 



G 
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An 



cxp 



L i=l 



■K{Xi-l,Xj) 

ir(xi-i,Xi) 



dP, 



where P is the Markov chain with transition probability 7r(x,y). By the 
ergodic theorem, 

lim P{A n ) = 1 

n— >oo 

as n — ► oo. An application of Jensen's inequality yields 

liminf-logQ n (A) > -V7r(x,y)log ' q(x). 
n->co n ^ vr(x,y) 

We can pick any tt, provided q is invariant for tt, that is, qix = q and we will 
get an upper bound for H n (q). Therefore, 

#7r(g)<jnf ^ ?r(a, y) log ^'^ gfo). 
In fact, there is equality here and H ir {q) is dual to logA^(y). 



#,,■(<?) = sup 



V(-) L x 



^F^M^-iogA^y) 



logA^(F)=sup ^V(x)q(x) - H v (q) 

?(■) L x 

If we are dealing with a process in continuous time, we will have a matrix A of 
transition rates {a(x, y)} with a(x, y) > for x / y an d a ( x 5 2/) = 0- The 
transition probabilities {ir(t, x, y)} will be given by n(t, x, y) = (expL4)(x, y). 
If a is symmetric, that is, a(x,y) = a(y,x), then so is 7r(t, •,•). The uniform 
distribution on X will, in this case, be the invariant measure. The role of 
log A( V) in the previous discrete situation will now be played by the principal 
eigenvalue A a (V) of A + V, where (A + V)(x,y) = a(x,y) + 5(x,y)V(y). Here, 
/ = {6(x, y)} is the identity matrix. This is the conjugate of the rate function 
H(q). 



H a (q) = sup 
V(-) 



J2V(x)q(x)-X a (V) 



X a (V) = SUp 
<}(■) 



Y / V(x)q(x)-H a (q) . 
Note that, in the symmetric case, we have the usual variational formula 



\a(V) 



sup 

«(•) 

£>(z)] 2 =i 



^V{x)u 2 (x) + ^a{x,y)u{x)u{y) 



.r.Lj 
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sup 
«(•) 
£>(*)] 2 =i 



J2V{x)u 2 (x) - \ ^a(x,y){u(x) -u(y)Y 



sup 

9(0>0 
£.«(*)=! 



£y(*) U 2 (z)-±£a(*, y )(,/g 



?(y))" 



It is therefore not very surprising that 



4. Small random perturbations. The exit problem. The context for large 
deviations is a probability distribution that is nearly degenerate. So far, this 
has come from the law of large numbers or the ergodic theorem. But it can 
also come from small random perturbations of a deterministic system. 

Consider, for instance, the Brownian motion or Wiener measure on the 
space of continuous functions Co[[0, T]; R] that are zero at 0, that is, x(0) = 0. 
We can make the variance of the Brownian motion at time t equal to et 
instead of t. We can consider x e {t) = y/ex(t). Perhaps consider e = - and 
x n (t) = ^^2iHi(t), the average of n independent Brownian motions. In any 
case, the measure P t of x e is nearly degenerate at the path /(•) = 0. 

We have a large deviation principle for P e with a rate function 

H{}) = \ f T [f'{t)fdt. 
Jo 

There are various ways of seeing this. We will exhibit two. As a Gaussian 
process, the generating function for Brownian motion is 



log£ 



cxp 







x(t)g{t)dt 



Its dual is given by 

rT 

sup 

<?(■) 







f(t)g(t)dt-l 



T 



JO 



T 



min(s, t)g(s)g(t) ds dt. 



o jo 



min(s, t)g(s)g(t) ds dt 



[f(t)] 2 dt. 



Or, we can perturb y/ex(t) by f{t) and change the measure from P e to 
Q e that concentrates near / rather that at 0. Of course, there are many 
measures that do this. Q e is just one such. The relative entropy is easily 
calculated for our choice. It is 

F(Q e ,P e ) = l [ T [f'(t)] 2 dt. 
-Ze Jo 
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This is also a lower bound for the rate function H{f). In general, the rate 
function has a lower bound which is the "entropy cost" of changing the 
measure to do what we want it to do, namely, to concentrate on the "wrong 
spot." The cheapest way of achieving this is invariably the actual rate func- 
tion. Let us now start with an ODE in R n , 

dx{t) = b(x(t)) dt, x(0)=xo, 

and perturb it with a small noise, 

dx e (t) = b(x e (t)) dt + \/ef3(t), x(0) = x , 

where (3{t) is the Brownian motion in R n . As e — > 0, the distribution P e of 
x e (-) will concentrate on the unique solution xq(-) of the ODE. But there 
will be a large deviation principle with a rate function 



which will happen if ■ v /e/3(-) concentrates around g(-) with </(•) = /'(•) — 



One application of this is to the "exit problem" and the resulting inter- 
pretation of "punctuated equilibria." Let us take the ODE to be a gradient 
flow: 



The system will move toward a minimum of V. If V has multiple local 
minimum or valleys surrounded by mountains, the solutions of the ODE 
could be trapped near a local minimum, depending on the starting point. 
They will not move from one local minimum to a deeper minimum. On the 
other hand, with even a small noise, they develop "wanderlust." While, most 
of the time, they follow the path dictated by the ODE, they will, from time 
to time, deviate sufficiently to be able to find and reach a lower minimum. 
If one were to sit at a deeper minimum and wait for the path to arrive, then 
trace its history back, one would be apt to find that it did not wander at all, 
but made the most efficient beeline from the previous minimum to this one, 
as if it were guided by a higher power. The large deviation interpretation 
of this curious phenomenon is simply that the system will experiment with 
every conceivable path, with probabilities that are extremely small. The 
higher the rate function, that is, the less efficient the path, the smaller will 
be the probability, hence, less frequent the attempts involving that path. 
Therefore, the first attempt to take the path to its new location will be the 
most efficient path, one that climbs the lowest mountain pass to get to the 
new fertile land! 




&(/(■))• 



dx(t) 



(VV)(x(t)) dt. 
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5. Gibbs measures and statistical mechanics. Let us look at point pro- 
cesses on R. The simplest is the Poisson point process. One can imagine it 
as starting from a finite interval [— 5, S] and placing k n = pn particles there 
randomly, independently and uniformly. Their joint density is 

—j-dxi---dx kn . 

n K n 

We can ignore the labeling and think of it as a point process P n . The density 
is then 

u 



n 



dxi--- dx k „ ■ 



As n — > oo, we obtain a point process P which is the Poisson point process 
with intensity p. We can try to modify P n into Q n given by 



dQ n = exp 



'■J 



dP n 



where V > has compact support. Here, Z n is the normalizing constant 



exp 



i.j 



dP n - 



The limit 



lim — log Z n 

n — >oo 77, 



is called the free energy. If Q is any stationary, that is, translation-invariant 
point process on R, its restriction to [a, b] is a symmetric (i.e., permutation- 
invariant) probability distribution q\ a n on 

UM fe - 

fc>0 

Similarly, the Poisson point process with intensity p generates p\ a u , a convex 
combination of uniform distributions on [a,b] k with Poisson weights 

~ P (b-a) P k (b-a) k 
k\ ' 

The relative entropy ff(9[ ,6])P[a,6]) = H^{Q,P) depends only on the length 
t = (b — a) and is super additive, that is, 

H il+h (Q,P)>H £l (Q,P)+H h (Q,P). 

Therefore, the limit 

lim -MQ,P) = sup H e (Q,P) = H(Q,P) 

l— >oo I i 
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exists and is called the specific relative entropy. There is the expectation 

■x a ,xp£[—n/2,n/2] 

Then, 

*(V) = -wf[E e *\V]+H(Q,P)] 
w 

over all stationary point processes and the minimizer (unique) is the limit 
of Q n . Such considerations play a crucial role in the theory of equilibrium 
statistical mechanics and thermodynamics [16]. 

6. Interacting particle systems. Interacting particle systems offer an in- 
teresting area where methods of large deviation can be applied. Let us look 
at some examples. Consider the lattice Zjv of integers modulo N. We think 
of them as points arranged uniformly on the circle of unit circumference or 
the unit interval with endpoints identified. For each x G Zjv, there corre- 
sponds a point £ = on the interval (really, the circle). There are a certain 
number of particles in the system that occupy some of the sites. The par- 
ticles wait for an exponential time and then decide to jump. They pick the 
adjacent site either to its left or right with equal probability and then jump. 
In this model, the particles do not interact and, after a diffusive rescaling 
of space and time, the particles behave like independent Brownian motions. 
Since there is a large number of particles, by an application of the law of 
large numbers, the empirical density will be close to a solution p(t, £) of the 
heat equation 

Pt = \p^ 

with a suitable initial condition, depending on how the particles were initially 
placed. 

We can introduce an interaction in this model, by imposing a limit of at 
most one particle per site. When the particle decides to jump to a randomly 
chosen site after a random exponential waiting time, it can jump only when 
the site is unoccupied. Otherwise, it must wait for a new exponential time 
before trying to jump again. All of the particles wait and attempt to jump 
independently of each other. Since we are dealing with continuous-time and 
exponential distributions, two particle will not try to jump at the same time 
and we do not have to resolve ties. 

One can express all of this by simply writing down the generator, rj is a 
configuration and Xn = {0, lj^is the set of all configurations. T)(x) = 1 if 
the site x is occupied and rj(x) = if the site x is free. A point in Xn is just 



E Q \V] = lim -E Q 

n— >oo n 
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a map r\ : Zjy — ► {0, 1}. The generator An acting on functions / : Xjy — ► R is 
given by 

(A N f)(v) = t'EvfrW ~ V<* + l))(/(^ +1 ) " /fa)) 

+ (l- r? (x-l))(/(r ? ^- 1 )-/(7 7 ))] I 

where J7 x,y is defined by 

f r/(x), i£z = y, 
rf' v {z) = lrj(y), ifz = x, 
(r](z), otherwise. 

This system has some interesting features. The total number J2xVi x ) °f 
particles does not change over time. Particles just diffuse over time. It will 
take time of order N 2 for the effect to be felt at a distance of order N. The 
equilibrium distributions p,N,k are uniform over all possible (^) configura- 
tions, where k is the number of particles. In particular, there are multiple 
equilibria. Such systems can be locally in equilibrium while approaching the 
global equilibrium rather slowly. For example, we initially place our parti- 
cles in such a way that one half of the circle C\ has particle density , while 
the other half C2 has density |. The system will locally stay near the two 
different equilibria at the two intervals for quite some time, with just some 
smoothing near the edges. If we wait for times of order N 2 , that is, time N 2 t 
with t > 0, the density will be close to p(t, J^). A calculation that involves 
applying the speeded up generator A^ 2 ^4at to expressions of the form 

shows that p(t,£) will be the solution of the heat equation 

Pt(t,0 = ±ptt(t,t); p(0, = \ l Cl (0 + § lc 3 (0 ■ 

In this case, this behavior is the same as if the particles moved independently 
without any interaction. While this is true at the level of the law of large 
numbers, we shall see that for large deviation rates, the interaction does 
matter. 

7. Superexponential estimates. Let us consider the process in equilib- 
rium, with k = pN particles. We let N, k go to 00 while p is kept fixed. It is 
easy to see, for any fixed site x, that 

and {rj(xj)} become independent for any finite set of distinct sites. If /(r/) 
is a local function and f x (j]) = f( T xV) its translation, then 
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as N — ► c<3. The probabilities of deviations decay exponentially fast. The 
exponential rate is given by 

H(a)= inf H(Q,P p ), 
QeM(a,p) 

where M. (a, p) consists of stationary measures Q with density p and [f(rj)] 
a, and P p is Bernoulli with density p. H(Q,P) is the specific entropy, cal- 
culated as the limit 

lim -H(Q n ,P n ), 

n— >oo n 

where Q n and P n are restrictions of Q and P to a block of n sites. This 
is, of course, an equilibrium calculation done at a fixed time. What about 
space-time averages 



1 



NT 

The rate function is now some Hx(a) that depends on T. While this is hard 
to calculate, one can show that 

lim H T (a)= inf H(P p[0 ,P p ) d£, 

1 — >oo p(-)£V(a,p) 

where V(a,p) consists of density profiles that satisfy 

J p(0 di = p and J E p P (0 [f{rf)] d£ = a. 

Note that the rate increases to a finite limit as T — > oo. The reason is that 
large-scale fluctuations in density can occur with exponentially small prob- 
ability and these fluctuations do not necessarily decay when T is large. 
Particles diffuse slowly and it takes times of order N 2 to diffuse through 
order iV sites. 

To resolve this difficulty, we define the approximate large-scale empirical 
density by 

x:\x/N-£,\<e 



and compare 



with 



i r N 2 T 

jy3 J, EfMt))dt = Y NJ (T) 
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where 

f(p)=E p "[f(r,)]. 

One can show that 

limsuplimsup — logP[|YAT,/(T) - Y N j >e (T)\ >S] = -oo 

for any S > 0. In other words, in time scale N 2 , the large scale density fluctu- 
ations are responsible for all large deviations with a logarithmic decay rate 
of N. Such estimates, called superexponential estimates, play a crucial role 
in the study of interacting particle systems where approach to equilibrium 
is slow in large scales. 



8. Hydrodynamical limits. If we consider a large system with multiple 
equilibria, the system can remain in a local equilibrium for a long time. Let 
the different equilibria be labeled by parameter p. If the system lives on a 
large spatial domain and if £ denotes a point in the macroscopic scale of 
space, a function could describe a system that is locally in equilibrium, 
albeit at different ones at macroscopically different points. If t is time mea- 
sured in a suitably chosen faster scale, then p(£) = p(t,£) may evolve gently 
in time and converge as t — > oo to a constant p that identifies the global 
equilibrium. 

For instance, in our example, let us initially start with k = Np particles 
and distribute them in such a way that we achieve a density profile po(£) as 
N — > oo . Technically, this means 

for bounded continuous test functions J. What will we see after time tN 2 , 
especially when N is large? 

The answer is very easy. Consider the sum 

and apply the speeded up operator N 2 Ajy to it. 

f(v x ' y )-m = ^[v(x)-v(y)] 

(N 2 A N f)( v ) = ^#)(1 - r)(x + 1)) 
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£>(x)(l - v(x + 1)) - ri(x + 1)(1 - r](x))} 

X 

x + V 



N 
~2 



X j 



J 



N 

)-T){x + l)] 

x + l 

N 



J 



J 



x 
N 

x + 
N 



ac — 1 



1 

2iV 



E J " 



leading to the limiting heat equation 



+ J 

r)(x), 

p(o,o=po(e)- 



r/(x) 



Let us change the problem slightly. Introduce a slight bias. The probabil- 
ities to right and left are \ ± jj^b, where b > is the bias to the right. This 
introduces an extra term, so that now 



1 



N 



1 



2iV 



AT 



where 



2iV 



E J ' ( S ) iv(x)(l - rj(x + 1)) + rj(x)(l - V (x - 1))]. 



This is a problem because -Fat is nonlinear in i] and cannot be replaced by 
a simple expression involving p. If we were locally in equilibrium, Fn would 
be replaced by 



N 



N 



N 



where rjy^(jj,r]) is the empirical density. If we were in a global equilibrium, 
in the faster time scale, 



F N ( V (t))-G N , e (ri(t))dt 



>S, 



with superexponentially small probability. The importance of superexponen- 
tial estimates lies in the following elementary, but universal, inequality: 



Q(A)< 



2 + H(Q,P) 



\og(l/P(A)) 

If P(A) is superexponentially small and H(Q,P) is linear in N, then Q(A) 
is small. Perturbation of the process by a slight bias of order i produces 
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a relative entropy of order per site per unit time. With N sites and 
time of order N 2 , this is still only linear in N. Changes in initial conditions 
can also only contribute linear relative entropy. So, we can carry out the 
approximation and end up with 

Note that without exclusion, the limit would have been 

pt = \p& - [Me> 

which is just the CLT when there is a small mean. 

9. Large deviations in hydrodynamical limits. We saw earlier that in our 
model, if we started with an initial profile with density po, in a speeded up 
time scale, the system will evolve with a time-dependent profile which is the 
solution of the heat equation 

pt = y^. 

This is, of course, true with probability nearly 1. We can make the initial 
condition deterministic in order that no initial deviation is possible. Still, our 
random evolution could produce, with small probability, strange behavior. If 
we limit ourselves to deviations with only exponentially small probabilities, 
what deviations are possible and what are their exponential rates? 

We can doctor the system with a bias b which is not constant, but is a 
function b(t, jj) of x and fast time t with b(t,£) being a nice function of t 
and £. This will produce a solution of 

p t (t, = \Dlp(t, £) - D ( [b(t, £)p(t, e)(l - P(t, £))], p(0, = po(0- 

This can be done with an entropy cost that can be calculated as 

*$) = \[ j IK*, 0\ 2 p(t, (i - p(t, 0) dt d£, 

for the duration [0, T]. If p(t,£) is given, then we can minimize *$>(b) over all 
compatible b. This is a lower bound for the rate. One can match this bound 
in the other direction. 

10. Large deviations for random walks in random environments. We 

will start with a probability space (Q,T,,P) on which Z d acts ergodically 
as a family t z of measure-preserving transformations. We are given tt(u,z), 
which is a probability distribution on Z d for each u>, and is a measurable 
functions of uj for each z. One can then generate random transition proba- 
bilities tt(uj,z',z) by defining 

tt(u>, z , z' + z) = tt{t z iuj, z). 
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For each uj, it(uj,z',z) can serve as the transition probability of a Markov 
process on Z d , and the measure corresponding to this process, starting from 
0, is denoted by Q u . This is, of course, random (it depends on to and is 
called the random walk in the random environment uj). One can ask the 
usual questions about this random walk and, in some form, they may be 
true for almost all uj with respect to P. The law of large numbers, if valid, 
will take the form 

m(P) a.e. Q u 



P 



UJ 



lim 

n— »oo 



1. 



Such statement concerning the almost sure behavior under for almost all 
uj with respect to P are said to deal with the "quenched" version. Sometimes 
one wishes to study the behavior of the "averaged" or "annealed" measure, 



Q = J Q"P{duj). 



The law of large numbers is the same because it is equivalent to 

5,i 



Q 



uj : lim 

n— >oo fi 



m{P) 



1. 



On the other hand, questions on the asymptotic behavior of probabilities, 
like the central limit theorem or large deviations, could be different for the 
quenched and the averaged cases. 

A special environment, which is called the product environment, is one in 



which n(u, z' , z' + z) 



are independent for different z and have a common 



distribution (3 which is a probability measure on the space M of all proba- 
bility measures on Z d . In this case, the canonical choice for (Q,,T,,P) is the 
countable product of M and the product measure P with marginals /3. 
There are large deviation results regarding the limits 



and 



lim — log Q u 

n — >oo ji 



lim — log Q 

n— >oo fi 



n 



S n 



~ a 



n 



Ka) 



1(a) 



The difference between I and I has a natural explanation. In the special 
case of Z, in terms of the large deviation behavior in the one-dimensional 
random environment, it is related to the following question. If, at time n, 
we see a particle reach an improbable value na, what does it say about the 
environment in [0, na]? Did the particle behave strangely in the environment 
or did it encounter a strange environment? It is probably a combination of 
both. 
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Large deviation results exists, however, in much wider generality, both 
in the quenched and the averaged cases. The large deviation principle is 
essentially the existence of the limits 

lim -logE[exp[(0,S n )]]=tf(0). 

n— >oo 77 

The expectation is with respect to Q u or Q, which could produce different 
limits for ^ . The law of large numbers and central limit theorem involve the 
differentiability of at 9 = 0, which is harder. In fact, large deviation results 
have been proven by general subadditivity arguments for the quenched case. 
Roughly speaking, fixing lo, we have 

Q^Sk+e ~ (k + £)a] > Q"[S k ~ ka] x Q^[S e ~ la}. 

It is harder for the averaged case. We want to prove that the limits 

Sn 



lim — logQ 

n— too 77, 



~ a 
n 



-1(a) 



or, equivalently, 



lim ±log£«[exp[(0,S n >]] = *(0), 



n — >oo 77 



exist. The problem is that the measure Q is not very nice. As the random 
walk explores Z d , it learns about the environment and in the case of the 
product environment, when it returns to a site that it has visited before, 
the experience has not been forgotten and leads to long term correlations. 
However, if we are interested in the behavior S n — na with a ^ 0, the same 
site is not visited too often and the correlations should rapidly decay. 
One can use Bayes' rule to calculate the conditional distribution 

Q[S n+ i = S n + z\S 1 ,S 2 ,...,S n ] = q(z\w), 

where w is the past history of the walk. Before we do that, it is more con- 
venient to shift the origin as we go along so that the current position of the 
random walk is always the origin and the current time is always 0. An n step 
walk then looks like w = {So = 0, . . . , 5_ n }. We pick a z with probability 
q(z\w). We obtain a new walk of n+ 1 steps w 1 = {S' Q = 0, S'_i, ■ ■ ■ , S'__^ n+l ^} 
given by S'_n. +1 \ = S_k — z for k > 0. We can now calculate q(z\w). We need 
to know all the quantities {k(w,x, z)}, the numbers of times the walk has 
visited x in the past and jumped from x to x + z. It is not hard to see that 
the a posteriori probability can be calculated as 

_ jTr(z)U zl TT(z') k ^°^(](dTT) 
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While this initially makes sense only for walks of finite length, it can clearly 
be extended to all transient paths. Note that although we only use k(w, 0, z'), 
in order to obtain the new k(w', 0, z'), we would need to know the collection 
{k(w,z,z')}. 

Now, suppose that R is a process with stationary increments {zj}. Then, 
we can again make the current position the origin and if the process is 
transient, as it would be if the increments were ergodic and had a nonzero 
mean o, the conditional probabilities q{z\w) would exist a.e. R and can be 
compared to the corresponding conditional probabilities r(z\w) under R. 
The relative entropy 



H{R) = E 



R 



> r(z\w)\og 

t—f Q{z\w) 



is then well defined. 
The function 



1(a) = inf H(R), 

R: J zidR=a 
R ergodic 

defined for a ^ 0, extends as a convex function to all of R d and, with this / 
as rate function, a large deviation result holds. 

We now turn to the quenched case. Although a proof using the subaddi- 
tive ergodic theorem exists, we will provide an alternate approach that is 
more appealing. We will illustrate this in the context of Brownian motion 
with a random drift. Instead of the action of Z d , we can have R d acting on 
(Q, E, P) ergodically and consider a diffusion on H d with a random infinites- 
imal generator 

(£ w u)(z) = !(Au)(x) + (&({*/, ar), (Vu)(i)) 

acting on smooth functions on R rf . Here, b(uo,x) is generated from a map 
b(u)) : VL —> H d by the action of {r x : x £ H d } on uj: 

b(uj,x) = b(r x uj). 

Again, there is the quenched measure Q u that corresponds to the diffusion 
with generator C u that starts from at time 0, and the averaged measure 
that is given by a similar formula. This model is referred to as diffusion with 
a random drift. Exactly the same questions can be asked in this context. We 
can define a diffusion on f2 with generator 

£ = ±A + (6(w),V), 

where V = {Di} are the generators of the translation group {t x :x G H d }. 
This is essentially the image of lifting the paths x(t) of the diffusion on H d 
corresponding to L w to f2 by 

U(t) = T x{ t)U). 
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While there is no possibility of having an invariant probability measure on 
R rf , on $7, one can hope to find an invariant probability density <j)(uj), that 
is, to find 4>(u>) > in L\{P) with J <j)dP = 1 which solves 

%A<t> = V- (b<p). 

If such a <f> exists, then we have an ergodic theorem for the diffusion process 
Q u corresponding to C on f2, 



lim 

t-*oo t Jo 



f(u( 8 ))d8= / f^^dP 



a.e. P. 



This also translates to an ergodic theorem on R d . If we define the stationary 
process g by g(u>,x) = f(r x u), then 

lim - f* ' f(u,x(s))ds= [ f(u;U(uj)dP a.e. , a.e. P, 

t~*co t Jo J 

where Q w is now the quenched process in the random environment. Since 



x(t) 



b(tv,x(s))ds + (3(t), 



where /?(•) is the Brownian motion, it is clear that 
x(t) 



lim 

t— >oo t 



b(u)<t>(u)dP a.e. a.e. P, 



providing a law of large numbers for x(t). While we cannot be sure of finding 
cf) for a given b, it is easy to find a b for a given eft. For instance, if <p > 0, 
we could take b=^. Or, more generally, b = ^ + 4 with V • c = 0. If we 

change 6 to 6' which satisfies |A^> = V • (b'cj)), the new process Q b '' u with 
drift 6' will, in the time interval [0, t) , have relative entropy 

| r||6(a;( S ))-6'(a;( S ))|| 2 d S 



and, by the ergodic theorem, one can see that, a.e. with respect to P, 



lim -E Q 



\\b(u(8))-V( U (8) 



'ds 



l -\\\b{u)-b'{u)f^)dP. 



Moreover, for almost all uj with respect to P, almost surely with respect to 



lim 



x(t) 



t^oo t 

If we fix J b' (<jj)(j){<jj) dP = a, the bound 



b'{uj)<l){uj)dP. 



lim inf - log Q u 

t— >oo t 



x(t) 



t 



~ a 



> 



\b-b'\\ 2 6dP 
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is easily obtained. If we define 



1(a) 



inf 

V,<t> 

(l/2)A^=V-(6'<? 
/ b'4>dP=a 



\b-b'\\ 2 0dP, 



then 



liminf — logQ^ 

t— >oo t 



x(t) 



t 



> -1(a). 



Of course, these statements are valid a.e. P. One can check that I is convex 
and that the upper bound amounts to proving the dual estimate 



where 



lim -logP Q >^ W) ]<^(0), 

t— >oo t 



¥(0) = sup[<M)-I(a)]. 



We need a bound on the solution of 

u t = ±Au + (b, Vu) 

with u(0) = exp[{9,x)]. By the Hopf-Cole transformation v = logn, this re- 
duces to estimating 

v t = \Av + \\\Vv\\ 2 + (b,Vv) 

with v(0) = (0,x). This can be done if we can construct a subsolution 

±X7 -w + ^WVwf + (b,w) <ij>{p) 

on f2, where w : f2 — > R rf satisfies / w dP = 6 and V x w = in the sense that 
DiWj = DjWi. The existence of the subsolution comes from convex analysis. 



sup 

v,<t> 

(l/2)A</-=V-(6'</>) 



{b',0)<f>dP-\ / \\b-b 



u'\\2. 



dP 



sup sup inf 

4> V u 



sup inf sup 

d> u V 



b-b'\\ 2 6dP 



(b',0)4>dP-\ 

+ ±[Au + (b',Vu)]^dP 



dP-i 



16 -b 



'l|2. 



dP 



+ J \[Au + (b',Vu)]^dP 
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sup inf J sup (b',6 + V-u)0dP- \ J \\b - b'f^dP + J \Au^dP 

sup inf f [±Au + (b, 9 + Vu) + |||0 + Vu\\ 2 ](f>dP 

sup inf \±V ■ w + (b,w) + ±\\w\\ 2 }(bdP 
/ Vxw=0 J 12 2 " 11 J 

J wdP=6 

inf sup / [~ V • w + (b, w) + Uw\\ 2 ]4>dP 



Vx 

/ wdP=e 

inf sup[^V • w + (6,w) + A- 1 1 t-fj' 1 1 2 1 , 
Vxio=0 u 2 

j wdP=e 



which proves the existence of a subsolution. One needs to justify the free 
interchange of sup and inf. In passing from the first line to the second, the 
restriction on b' and <j) is replaced by the Lagrange multiplier u. 

This can be viewed as showing the existence of a limit as e — > (homog- 
enization) of the solution of 

u\ = ^Au e + ^||W|| 2 + (b(^,u?j,Vu^ 

with u e (0,x) = f(x). The limit satisfies 

u t = V(Vu) 

with u(0,x) = f(x). 



11. Homogenization of Hamilton Jacobi Bellman equations. This can 
be generalized to equations of the form 

u\ = 6 -Aif + h(^, Vu e ,w) , u(0,x) = f(x), 

where H(x,p,u) = H(t x lv,p) is a stationary process of convex functions in 
p. By the changes of variables x = ey, u = ev, t = er, this reduces to the 
behavior of ev e (-j, |), where v e solves 

v% = \Av e + H(y, W, u), v'(0, y) = e^/M- 

From the principle of dynamic programming, we can represent the solution 
as the supremum of a family of solutions of linear equations, 



v e (r,y)= sup w e (b(- r ),T -r,y), 
6(-,-)eB 
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where w e solves 

< + \Aw £ + (6(r, y), Vw e ) - L(r y u, b(r, y)) = 0, 

w^e- 1 T,y) = e~ 1 f(ey), 

L being the convex dual of H with respect to the variable p. If b is chosen 
as b(T y u>), for suitable choices of b(u>) G C that admit positive integrable 
solutions (j> to 

±A(j)(T x uj) = V • b(r x u;)(f>(T x ,u), 
then it is not hard to see that 

eu> £ (0,0)->/^T J b{u)<]){u) dP^j -T J L{b{u),u)(j)(uj)dP. 
This provides a lower bound 
liminf u e (T, 0) > sup 



e->0 



bee 



f[T b{oj)(j){u)dP) -T / L{b{co),uj)(f>(uj)dP 



which can be shown to be an upper bound as well. 



12. History and references. The origin of large deviation theory goes 
back to Scandinavian actuaries [10] who were interested in the analysis of 
risk in the insurance industry. For sums of independent random variables, 
the general large deviations result was established by Cramer in [1]. The 
result for empirical distributions of independent identically distributed ran- 
dom variables is due to Sanov [18]. The generalization to Markov chains 
and processes can be found in several papers of Donsker and Varadhan 
[3, 4, 5, 6] and Gartner [13]. The results concerning small random pertur- 
bations of deterministic systems goes back to the work of Varadhan [20], 
as well as Vencel and Freidlin [12]. Several monographs have appeared on 
the subject. Lecture notes by Varadhan [21], texts by Deuschel and Stroock 
[7], Dembo and Zeitouni [2], Schwartz and Weiss [19], Ellis [9], Dupuis and 
Ellis [8] and, most recently, by Feng and Kurtz [11]. They cover a wide spec- 
trum of topics in large deviation theory. For large deviations in the context 
of hydrodynamic scaling, there is the text by Kipnis and Landim [14], as 
well as an exposition by Varadhan [23]. As for large deviations for random 
walks in a random environment, see [24], as well as references in Zeitouni's 
article [25]. For a general survey on large deviations and entropy, see [22]. 
The results on homogenization of random Hamilton-Jacobi-Bellman equa- 
tions and its application to large deviations has appeared in [15] and [17]. 
Undoubtedly, there are many more references. A recent Google search on 
"Large Deviations" produced 3.4 million hits. 
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